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^^ I Abstract The adaptive evolution of a population under the influence of mu- 

tation and selection is strongly influenced by the structure of the underlying 
fitness landscape, which encodes the interactions between mutations at dif- 
ferent genetic loci. Theoretical studies of such landscapes have been carried 
out for several decades, but only recently experimental fitness measurements 
encompassing all possible combinations of small sets of mutations have be- 
come available. The empirical studies have spawned new questions about 
the accessibility of optimal genotypes under natural selection. Depending on 
population dynamic parameters such as mutation rate and population size, 
evolutionary accessibility can be quantified through the statistics of acces- 
sible mutational pathways (along which fitness increases monotoncially), or 
through the study of the basin of attraction of the optimal genotype under 
greedy (steepest ascent) dynamics. Here we investigate these two measures 
of accessibility in the framework of Kauffman's LK-mode\, a paradigmatic 
family of random fitness landscapes with tunable ruggedness. The key pa- 
rameter governing the strength of genetic interactions is the number K of 
interaction partners of each of the L sites in the genotype sequence. In gen- 
eral, accessibility increases with increasing genotype dimensionality L and 
decreases with increasing number of interactions K. Remarkably, however, 
we find that some measures of accessibility behave non-monotonically as a 
function of K, indicating a special role of the most sparsely connected, non- 
trivial cases K — 1 and 2. The relation between models for fitness landscapes 
and spin glasses is also addressed. 
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1 Introduction and background 

The basic idea of the theory of evolution as presented by Charles Darwin 
PQ is that heritable phenotypic variability and the subsequent competition 
between reproducing organisms (possibly aided by spacial separation of habi- 
tats) shape the diversity of life as it can be observed today. This theory, purely 
descriptive at first, was cast in a quantitative form by Haldane, Fisher and 
Wright [2 , 3 , : 4J . This quantification, which is now known as the Modern Syn- 
thesis [5] of evolution founded the field of mathematical population genetics, 
which has remained an active area of research ever since. As the mutations, 
i.e. changes to the hereditary material of an organism, which give rise to the 
diversity in the population, cannot be controlled, they are modeled as random 
events within the Modern Synthesis. Furthermore, the intricate mechanisms 
by which the hereditary (or 'genetic') information is processed are accounted 
for in the form of stochastic models, thus methods familiar to statistical 
physicists find a natural field of application in this context. 

Within the Modern Synthesis, populations are subject to four evolution- 
ary forces: 

a) Selection by relative average growth rate of mutant populations, which 
is the mechanism by which different mutants compete. Selective forces 
are quantified by fitness, which is a measure of reproductive success and 
can be taken to be proportional to the number of offspring an individual 
leaves in the next generation. 

b) Genetic drift, which represents random demographic fluctuations due to 
finite population sampling effects around the deterministic process de- 
scribed by selection alonqj, 

c) mutations, which are random changes in the hereditary material, and 

d) recombination, which corresponds to two individuals combining (parts of) 
their genetic material in one common offspring. 

In this paper, we will focus specifically on the influence of selection on the 
adaptive fate of a population. 

During the last decade or so, the advent of modern technologies has made 
it possible to directly manipulate or at least inspect the genetic material of 
model organisms at a molecular level, thus putting the theoretical models of 
adaptation that have been at the center of attention thus far to the test. Of 
particular interest is the question of how different mutations interact in their 
effects on fitness. To give an important example, in [B] it was found that five 
mutations in the TEM-I j3- lactamase gene in the bacterium Escherichia coli 
together increase the resistance against a particular antibiotic by a factor 
on the order of I0 5 . Besides the obvious clinical relevance of this finding, it 
raises an interesting theoretical question [71I51IMTU] : Can such a combination 
of mutations emerge in natural populations, where the five mutations would 
have to appear and spread one at a time? In other words, is the optimal 

1 In the terminology of stochastic processes, this is the 'diffusive' term in the 
Fokker-Planck equation describing the probability distribution of mutant frequen- 
cies, while the term describing the deterministic behavior is usually called 'drift'. 
Since in almost all that follows, only selection will be considered, this should not 
lead to confusion. 



combination of mutations accessible to the population? In this paper, we will 
address this problem in two specific regimes of adaptive dynamics. 



1.1 Fitness landscapes and adaptive regimes 

We start from the common assumption that all the organism's properties 
are determined (in a given, fixed environment) by its hereditary information, 
which is represented as a sequence a — {o\, &%, . . . , ox} of length L called the 
genotype. Real genomes have, at the level of DNA base pairs, four different 
possible letters or 'alleles' at each of the L sites, i.e. 07 £ {G,A,T,C}. If 
the proteins produced after transcription are modeled (llj . the number of 
possible alleles goes up to 21 (the number of amino acids). However, in many 
cases the number of mutations in a population is small, such that at most 
two different alleles exist at a given genomic site and the genotype can be 
modeled as a binary sequence. Thus the fitness F (like all other properties) 
is a mapping from the Boolean hypercube to the real numbers. A natural 
measure of distance between any two states is the Hamming Distance (HD) 
counting the number of point mutations separating two genotypes, 

L 

d(o-i,o- 2 ) = ^(l-^ ll! ,<7 2 , i ) (1) 

where Kronecker's delta is defined by 5 XtV = 1 if x = y and 5 x _ y = other- 
wise. The dynamics of a population on a given fitness landscape (FL) F(cr) 
can be visualized as a hill climbing process, the statistical features of which 
depend on the mutation rate \x per genome, the population size N, and the 
scale s of typical fitness differences. In each new generation, every individual 
will spawn a number of offspring that is on average proportional to that in- 
dividual's fitness. The offspring will, up to mutations, have the same genetic 
configuration with the same fitness as the parent. Thus the population in gen- 
eral forms a cloud in sequence space. An important early result of theoretical 
population genetics is that the strength of demographic fluctuations ('drift') 
is governed by the inverse population size \/N |12pi3j . The parameters Ns 
and N n therefore gauge the strength of selection and mutations relative to 
the effect of drift, giving rise to different dynamical regimes |14) . Two regimes 
of particular interest in this paper will be described in the following. 

1.1.1 The strong selection /weak mutation regime 

If the product of population size and mutation rate per individual genome 
is low, Nfi <C 1, mutations are rare events and the population is genetically 
homogeneous ('monomorphic') most of the time. If furthermore the selec- 
tion strength is much greater than the strength of demographic fluctuations, 
Ns ^> 1, mutations that lower fitness (called 'deleterious') are rapidly driven 
to extinction. This evolutionary regime is called the 'strong selection/weak 
mutation (SSWM)' regime of adaptation fTMTB] . In this regime, the pop- 
ulation can only move uphill in fitness, and only between states of HD 1. 
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Fig. 1 (Color online) Two examples of empirical fitness landscapes obtained from 
growth rate measurements of different mutants of the fungus Aspergillus niger [101 
23.24J. The arrows between neighboring configurations point in the direction of 
fitness increase. Local maxima, i.e. configurations without outbound arrows are un- 
derlined. The colors indicate the basins of attraction under steepest ascent (greedy) 
dynamics. In the left FL the number of accessible paths n from the antipodal geno- 
type {1, 1, 1, 1} to the global fitness maximum {0, 0, 0, 0} is n = while in the right 
example n = 6 



Such a move is then called an 'accessible step'. Between mutational steps the 
population is located at one point in sequence space. 

1.1.2 The greedy adaptive regime 

If selection is still much stronger than drift, but the mutation rate and pop- 
ulation size are such that Nfi ~I>1 and N/j 2 <C 1, all L states at HD 
1 of the most populated state are generated in one generation, but almost 
none at greater mutational distance. Since selection is strong, the population 
will in the next generation be dominated by the most fit of the available L 
single mutants and continue from that state on. Thus the population can 
still be localized to a well-defined point in sequence space and the adaptive 
process takes the route of steepest fitness increase in each step, until a (local 
or global) fitness maximum is reached. This regime of adaptation is called 
'greedy' [UHMMI!] ■ 



1.2 Topographic properties of fitness landscapes 

There are many ways in which the ruggedness or complexity of a fitness land- 
scape can be quantified. Possible measures that have been proposed include 
the number of local maxima, the length of uphill adaptive walks, or the de- 
cay of fitness correlations along a random walk through the landscape [20l 
|2"T1I2"2"] . The present work is based on two particular topographic observables 
which are motivated by the two adaptive regimes described in the preceding 
sections. 

1.2.1 Accessible paths 



Within the SSWM regime, the notion of evolutionary accessibility of the 
optimal genotype can be quantified in a straightforward way j7]: Since the 



population is constrained to move uphill in single mutations steps, we say that 
a pathway connecting two genotypes is accessible if each mutation confers 
a selective advantage, i.e., if fitness increases monotonically along the path. 
This definition is closely related to the notion of reachability introduced in 
[25] . Moreover, since mutations are rare, only the shortest (direct) pathways 
connecting two genotypes will be considered^ ■ For two genotypes at HD d, 
there are d\ such pathways corresponding to the different orderings in which 
the mutations separating them can occur [261 . 

To obtain a global measure of accessibility, we focus here on pathways 
to the global fitness maximum which span the entire landscape [51110] . If the 
global maximum is located at a sequence <r max , there is a unique 'antipo- 
dal' sequence cr ant i such that d{(T max , <T an u) = L, and L\ shortest pathways 
connecting it to cr max . Our rationale for using the antipodal sequence as a 
starting point is that these paths are the longest possible direct evolutionary 
trajectories on the landscape, and as such their accessibility provides a lower 
bound on the accessibility of typical paths starting, for example, at a ran- 
domly chosen genotypeo 10]. On each realization of a FL, a certain number 
n of these pathways will be accessible in the sense defined above, see fig. Q] 
for two examples taken from an empirical data set for the filamentous fun- 
gus Aspergillus niger [10 , 23 ,24 . Evolutionary accessibility for an ensemble 
of FL's can thus be characterized by the distribution 7T£,(n), the probability 
that in a given realization there will be exactly n accessible paths. 

While evolution is constrained to accessible paths only in the SSWM 
regime, as a topographical feature of a FL, they can be defined independent 
of any adaptive regime. In this paper, we will analyze the behavior of the 
distribution til (n) for a particular model of fitness landscapes to be defined 
in sect. [21 

The study of accessible paths on FL's can be seen as a kind of percola- 
tion problem [25] on the Boolean hypercube [29,30 , with some important 
differences. In the (bond) percolation problem, individual bonds between ad- 
jacent states are open or closed independently of the others, while in the FL 
context, there is an effective correlation between bonds: Since a bond on a 
FL is open if the fitness increases when the bond is traversed in the direction 
towards the global fitness maximum, the states of two bonds adjacent to the 
same sequence are correlated via the fitness of that state. Moreover, the fact 
that the accessibility of a bond is determined by the difference of a globally 
defined fitness function implies a non-local gradient condition (the arrows in 
fig. Q] cannot form any loops) . 

1.2.2 Basins of attraction 

Assuming a continuous range of fitness values (so that ties cannot occur), 
evolutionary trajectories under greedy adaptive dynamics are completely de- 

2 For an empirical analysis of accessible pathways that includes the possibility of 
mutational reversions see [27| . 

3 At least for the HoC and RMF models described below, it can be shown that 
the statistics of accessible paths to the global maximum does not depend on the 
choice of the starting point. 



terministic and terminate at local fitness maxima. Thus each genotype er is 
uniquely associated to a fitness maximum that will be reached by a greedy 
trajectory starting at <x. The set of all configurations leading to a given max- 
imum is called the 'Basin of Attraction' (BoA) of that maximum (see fig. Q] 
for illustration). Under greedy dynamics, the importance of a local optimum 
can be measured by the size of its BoA. In particular, a landscape where 
the BoA of the global fitness maximum (GM) cr max contains a non- vanishing 
fraction of the FL for large L can be said to be dominated by the GM. 

Much like accessible paths, the BoA's are of direct relevance to the adap- 
tation of a population only if the dynamics take place in the greedy regime 
defined above, but the distribution of basin sizes can still be considered as a 
topographical feature and analyzed independently of the adaptive dynamics. 
Here, the distribution pl^gm) of the size bcM of the BoA of the GM and its 
expectation value (ogm) are studied in the framework of the model of FL's 
to be introduced in sect. [2j 



1.3 Fitness landscapes and spin glasses 

Models of spin glasses (SG's) were originally introduced in the context of dis- 
ordered magnets [21] and are now used as a paradigm for complex systems 
across a broad range of disciplines, e.g. in information theory and computa- 
tional optimization [3"!?ll3"3"] . There is an obvious close similarity between SG's 
and FL's, since in both cases a real number (energy or fitness) is associated to 
each configuration of a binary (spin) chain of a given length, and the interest 
is in complex landscapes with many local minima or maxima |34j . 

However, the two classes of systems differ importantly in that the instan- 
taneous state of a spin glass is uniquely described by a single configuration, 
whereas a biological population generally occupies a cloud in genotype space. 
A direct comparison between the dynamics of the two systems is therefore 
meaningful only when mutations are rare, N[i <C 1, such that the population 
cloud condenses to a single point in genotype space and adaptive steps occur 
one 'spin flip' at a time. In this regime it can be shown that the effective tran- 
sition rates between neighboring configurations satisfy detailed balance with 
respect to an equilibrium measure of Boltzmann form, P(cr) ~ e XNF ( cr ) ^ w ith 
fitness F taking the role of (negative) energy and population size N acting 
as an inverse temperaturq^ [351136] . 

In the spin glass context, the notion of evolutionary accessibility formu- 
lated above corresponds to the question of whether the ground state can be 
reached under a zero temperature dynamics, where only moves that lower 
the system energy are accepted. Similarly, the size of the BoA of the GM is a 
measure for the probability that the ground state can be reached through a 
steepest descent algorithm, which is of interest in the context of optimization 
theory [55] . 

4 The factor A = 1 or 2 depending on the population dynamic model, see [36] . 



2 Fitness landscape models 

Even though the pace at which new empirical FL's become available has 
increased greatly since the earliest examples {51123(124] with several having 
been published in the last year alone 10,37,38,39,40 , the number of FL's 
from the same system remains very limited. Furthermore, FL's from different 
organisms are difficult to compare as for example the way in which the muta- 
tions are chosen differs strongly [JO] . Thus statements about typical behavior 
with regard to evolutionary accessibility must rely on theoretical models. 

Many such models have been proposed, each based on different intuitions 
about the biological mechanisms that give rise to fitness. For example, the 
'holey landscape' model assumes that mutations among viable genotypes are 
effectively neutral, forming neutral networks of constant fitness interspersed 
by regions of lethality [2"M3Ti] . The question of evolutionary accessibility then 
reduces to the problem of percolation on the hypercube which, as we have 
argued above, is rather different from the problem addressed in the present 
paper. In the following we define the two classes of fitness landscape models 
that will be discussed in the remainder of this paper. 



2.1 House of Cards models 

A very influential model is the 'House of Cards' (HoC) model first proposed 
by Kingman J4T1142J . Based on the notion that the genome is a finely tuned 
machinery and any change (mutation) to it has the same effect as pulling 
one card from a house of cards, namely that everything has to be rebuilt 
from scratch, the fitness values are assumed to be independent, identically 
distributed random variables (i.i.d. RV's). This is equivalent to Derrida's 
random energy model (REM) of spin glasses [4HP4] . 

In contrast, the 'Mt. Fuji model' [3D] (named after the famous volcano in 
Japan) is a model within which fitness values are strictly ordered by HD to 
the GM, i.e. a state at HD k from the GM has lower fitness than any state 
at HD k — 1 and greater fitness than any state at HD k + 1. Such a fitness 
landscape has smooth flanks, no local optima and all LI paths are accessible. 

In terms of ruggedness, the HoC model is of maximal ruggedness with 
for example the highest average number of local maxima among the models 
considered here ^U\, while the Mt. Fuji model is of minimal ruggedness in 
the same sense. Recent studies of empirical data sets, however, imply that 
natural FL's are of intermediate ruggedness 10, 40, 45| . One model which 
allows to tune the degree of ruggedness of the resulting FL is the 'Rough Mt. 
Fuji' (RMF) model which is a variety of the HoC model with a superimposed 
linear trend towards a reference sequence (Tq. The fitness of a sequence <r in 
the RMF model is given by 

-FRMF(cr) = Tja. - cd((T, (T Q ), (2) 

where the r]^ are a family of i.i.d. RV's and c > is a constant [10,46 . If 
c — > oo, the random contributions clearly play no role and the landscape is 
of the Mt. Fuji type. On the other hand, for c = 0, eq. ([2]) is simply the HoC 



model. The RMF model can be seen as a HoC model (or, equivalently, REM) 
with a superimposed field of strength c favoring one allele in each position 
(that corresponding to the reference sequence <r ) over the other. 

However, the RMF model is just a phenomenological model introduced to 
study the effect of intermediate ruggedness and the 'fitness field' of strength 
c has no clear biological basis. A model where such intermediate degrees of 
ruggedness arise for biologically motivated reasons is the LK model proposed 
by Kauffman and Weinbergecl [If] • This model, which will be introduced in 
the next section, is the main focus of this paper. 



2.2 The LK model 

2.2.1 Definition 

Genes, i.e. parts of the genome that code for a protein [TT], can be considered 
as the fundamental building blocks of the genetic code. However, since the 
resulting proteins can only perform their function when working together 
with proteins coded for by other genes, a mutation changing the protein 
coded for by any given gene A also affects the function of all genes that 
rely on the protein produced by gene A: The organism will still transcribe 
and translate the other genes, thus incurring the cost of production for the 
corresponding proteins, but if they cannot be put to function due to lack of 
the protein produced by gene A, that effort is wasted. Thus if some gene B 
has a beneficial effect provided that gene A produces its protein, a mutation 
to gene A may render the presence of gene B deleterious, giving rise to so- 
called epistatic interactions [71148] . 

In an attempt to capture such interactions between genes, Kauffman and 
collaborators devised a model 20 ( 47 within which each site I — 1,2, ... ,L 
is assigned a set of interaction partners V\ = {I, fix, v^2, ■ ■ ■ vi,k}i consisting 
of indices of K other sites, see fig. [H note that site I itself belongs to the 
'neighborhood' vi. In the original formulation of the model, two ways for 
assigning these interaction partners were considered: Either K adjacent sites 
are chosen (e.g., the K/2 sites on either side of I or the K sites following 
I), or the K sites are chosen uniformly at random, making sure that no site 
appears more than once in a given neighborhood; for other possibilities see 
22,49,50,51 . Most features of the LK- model appear to be rather robust 
with respect to the way in which the interaction partners are assigned. Thus 
in the following, only the case of randomly chosen interaction partners will 
be considered. 

The model is generally known as the 'NK model'. Here we follow the convention 
of most population genetic literature in reserving the letter N for population size 
and designating the number of loci by L. 



L 



Fig. 2 An example of the interaction partners of a site i for K = 2. The sites h 
and j interact with the site 



■ i 



The fitness of a sequence cr is now defined as a sum over the individual 
site contribution^, 

L 

F LK {a) = Y,Ii{^j}jeu l ). (3) 

l=i 

The single site contributions /;(•) are taken to be identically distributed RV's, 
chosen independently for each of the 2 K+1 possible arguments. In numerical 
simulations, this is realized by generating L tables containing 2 K+1 i.i.d. 
RV's for each of the possible states of the L binary sequences of interaction 
partners of length K + 1. In the simulations reported here Gaussian RV's 
were used. 

The definition of the model implies that a mutation at a site i of a se- 
quence changes the fitness contributions fj(-) of all sites j that have site i 
among their interaction partners. This means that the case K = L — 1, in 
which every site interacts with all the others, corresponds to the HoC model: 
Each mutation changes all fitness contributions in the sum @ . On the other 
hand, K = means that the set of interaction partners of a given site i 
consists only of that site itself. Then, provided that the fitness values are 
drawn from a continuous distribution, one of the two fitness values for the 
two states of site i will be greater than the other and for any suboptimal 
configuration, there will be point mutations that increases fitness. Thus for 
K = 0, all L\ paths to the (unique) global fitenss maximum will be accessible. 
This limit corresponds to the Mt. Fuji model. Any intermediate value of K, 
< K < L — 1, corresponds to an intermediate degree of ruggedness. 

2.2.2 Relation to p-spin models 

It is a well-known fact that any fitness landscape F(cr) defined on the L- 
dimensional binary hypercube can be written as a linear combination of 
products ai 1 a i2 . . .a ip with 1 < i\ < ■ ■ ■ < i p < L and < p < L [22,521153]. 
This implies that each term in the sum (|2]) can be written as a linear combi- 
nation of terms describing the interactions among up to K + 1 loci. Since the 
same argument applies to the sum as a whole, the LK-mo&e\ can be viewed 
as a superposition of p-spin spin glass models [431144] with p < K + 1 [54] . 
However, in contrast to the fully connected p-spin model, where all possible 
p-spin interactions occur with nonzero coefficients, the interaction graph of 
the LK-mode\ is generally sparse, at least when the limit L — > oo is taken at 
fixed K (see below). Note, in particular, that when K = 1 and the interaction 

6 Often the sum on the right hand side of ([3]) is scaled by L _1 or L~ x ' 2 in order 
to guarantee the existence of the limit L — >• oo. Here only the rank ordering of 
fitness is of interest and the overall magnitude of F is irrelevant. 
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Fig. 3 (Color online) Numerical study of the full distribution 7ri,(n) of accessible 
paths for L — 9 and K = 8,7,6. Note that while for n > 1, the curves show 
a roughly exponential decay (linear in the logarithmic scales used here), n = 
carries much more weight than expected by extrapolation of the exponential. The 
inset shows how the corresponding probability -kl (0) behaves as a function of L for 
the HoC model (K — L — 1). Unless indicated otherwise, the data shown in this 
and subsequent figures is an average over 10 5 realizations 



partners are chosen to be adjacent sites along the sequence, the LK-mode\ 
is closely related to the one-dimensional Ising spin glass |55j . 



3 Evolutionary accessibility in the LK model 

Previous work on the LK-mode\ has focused largely on local measures of 
ruggedness, in particular on the density of local fitness maxima 56, 57,58. 59 
(see also 51,60 for studies of adaptation dynamics on LK landscapes). Here 
we present results for the two global measures of evolutionary accessibility 
introduced above, the number of accessible paths and the size of the basin 
of attraction of the global fitness maximum. Since the biologically relevant 
limit of genome lengths on the order of millions or billions of sites is beyond 
the reach of empirical or numerical exploration, a key issue is to identify the 
asymptotic behavior of these quantities in the limit of large L. For this pur- 
pose one needs to specify how the number of interaction partners K changes 
with L. We will consider three possible scalings for K: 

• L - K fixed, i.e. K « L for L > 1, 

• K/L fixed, i.e. K ex L and 

• K fixed. 



3.1 Accessible paths 

Figure [3] shows an example of the distribution of the number of accessible 
paths, 7Tx,(n), for a fixed sequence length L = 9 and various values of K. One 



11 




0.001 



0.0001 - 



Fig. 4 (Color online) Distribution of accessible paths for K = 1 and several values 
of L as a function of n/L\. This scale was chosen in order to facilitate comparison 
between different values of L. Note that the distribution is dominated by peaks 
that seem to persist as L increases, with major peaks occurring at the same value 
of x = n/L\. In the data for L = 6, only every third point is shown. This has no 
influence on the structure and only serves to render the plot less obfuscated 





Fig. 5 (Color online) 7r z,(n) as function of n/L\ for K — 2 (left) and K = 3 (right) 
and different values of L. While the individual peaks are still discernible for small 
values of L, they become less pronounced as L grows 



sees that the distribution decays roughly exponentially for large values of n 
but has much probability weight on configurations with no accessible paths 
at all. This implies that considering the expected number of accessible paths 
(til) does not suffice to characterize the distribution and one needs also to 
study the probability 717,(0) that in a given FL, there are no accessible paths 
at all [MTU] . An example for the behavior of this probability as a function of 
L can be seen in the inset of fig. [3] 

3.1.1 Full distribution for fixed K : Combinatorial structure 



We now consider in more detail how the shape of the distribution 717,(71) 
changes with K. For K = 0, the distribution of accessible paths is clearly 
given by ^L.K=o( n ) = $n,L\ since all paths are necessarily accessible. For 



12 



K = 1, a pronounced peak at n = L\ is still present, but the distribution 
is already dominated by realizations with no accessible paths, see fig. 01 For 
K = 2, the peak structure is still present for L — 4 but as L increases, 7rz,(n) 
becomes smoother, see left panel of figEJ This can be seen even more clearly 
for K = 3, where the curve for L = 6 shows no peak structure at all anymore. 

The case K = 1 seems to play a special role in this context, as the peaks 
remain at the same positions (relative to the total number L\ of possible 
paths) and the peak structure is visible for all sequence lengths for which 
the numerical simulations were performed. Moreover, closer inspection of the 
distribution reveals that certain values of the number of accessible paths n 
are forbidden, in the sense that for these values 7rt(n) = 0. In particular, the 
peak at n = L\ is separated from the rest of the distribution by a gap An 
which increases with increasing L. 

While the intriguing combinatorial structure seen in 7Tl(7i) for K = 1 
so far eludes our analytic understanding, the computation of An is rather 
straightforward, as it does not depend on the details of the LK-mode\. To 
see this, we consider a fitness landscape in which all L\ paths are accessible, 
and ask for the minimal number of pathways that become inaccessible by 
reversing the fitness difference along a single nearest-neighbor bond. The 
number of paths passing through a bond connecting genotypes at HD k and 
k + 1 from the GM is k\(L — k — 1)!, which becomes minimal at k = \L/2\ , 
thus An = [L/2\\(L — [L/2\ — 1)!. This gap is present for any value of K, 
but it is relevant only for K = 1, where the fully accessible state n — L\ still 
occurs with observable probabilitjQ. 

3.1.2 Expected number of accessible paths 

Despite the complexity of the distribution of accessible paths, some simple 
analytic results are available for the expected value (nr) [10,61,62 . Note 
first that, by linearity of the expectation, 

L\ 

(n L ) = ^Pfpath k accessible] = L!P[path k accessible] = L\P L (4) 

fe=i 

where Pl is the probability for a given path to be accessible (all paths are 
equivalent if one averages over realizations of the FL). For the HoC model, Pj, 
is the probability that L i.i.d. RV's (the fitness values encountered along the 
path) occur in increasing order. There is only one such ordering among the 
LI possible permutations, and thus in the HoC case Pl = 1/Ll and (n/,) = 1 
independent of L. In the case of the RMF model ([2]) it can be shown that Pj, 
decreases exponentially (rather than factorially) with L for c > 0, implying 
that the expected number of accessible paths grows without bounds 62.. 
Thus with respect to an external field, there is a clear distinction between 
the maximally rugged case of field strength c = and any positive field. 

A similar distinction is observed numerically between the L-dependence 
of (n^ when L — K is kept fixed and the other two scalings, as can be seen 

7 Note that, in contrast to the peaks seen in fig. [31 the gap does not live on the 
scale L\, since An/ L\ — > for L —5- oo. 
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Fig. 6 (Color online) Expected number of accessible paths as function of K for 
various values of L. The horizontal lines indicate that for L — K = 2, 3 and 4, 
respectively, {til) is almost constant, similar to the HoC case (L — K = 1) where 
(til) = 1 exactly. In the inset, the points for L/K = 2 (lower line) and L/K — 3 
(upper line) are connected by functions of the form 0.3e^ K with ft = 1 for the lower 
and ft = 2 for the upper line. This indicates that the average number of accessible 
paths grows exponentially with L when L/K is kept fixed 



in fig. [6] In this plot, the expected number of paths is depicted as a function 
of K for various values of L. The last point for each value of L corresponds 
to K = L — 1. These points all lie on the line (til) = 1, confirming the HoC 
result. The second to last points correspond to L — K = 2 and so on. While 
the data for L — K > 2 do not lie strictly on a line of constant (til}, they 
increase very slowly with K. On the other hand, data points belonging to 
a fixed value of K are almost equidistant along the ordinate in the semi- 
logarithmic plot, implying exponential growth. Similarly, if the ratio L/K is 
kept fixed (for example at L/K = 2 or L/K = 3, see inset of fig. [6]) a roughly 
exponential or slightly super-exponential growth of (til) with L is observed. 



3.1.3 Probability of finding no accessible path 



As was mentioned above in connection with the first numerical example of 
7Tx,(n) in fig. [3l in addition to the expected number of accessible paths it is 
useful to consider also the probability itl (0) that there is no accessible path. 
The complementary probability 1 — ttl (0) is the probability that there is at 
least one accessible path to the global fitness maximum, which can be viewed 
as an overall measure of accessibility [91IT0] . 

For the RMF model defined in eq. ([2]), numerical simulations show that 
(similar to the expected number of accessible paths) the cases c = and c > 
behave very differently for large L [ID] . In particular, for c > 0, 7Tl(0) is a 
non-monotonic function of L which declines for large L, indicating increasing 
accessibility, while for the HoC case c = 0, ttl (0) increases monotonically with 
L and appears to approach the limit limi,-^ 7Ti(0) = 1. 
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Fig. 7 (Color online) Left: The probability 7tl(0) of no accessible path as a func- 
tion of L, keeping L — K fixed. The curve K = L — 1 corresponds to the HoC case. 
All three curves increase for the entire range of L-values explored (up to L = 20 
for the HoC case, not shown here). Right: 7tl(0) at fixed ratio K/L 



The left panel of fig. [7] shows the behavior of 7Tl(0) for the LK- model at 
fixed values of L — K. The curves increase monotonically in a very similar 
fashion. The curve for K = L — 1 represents the HoC case which was studied 
in more detail in [TU] . where simulations showed that it remains monotonic 
at least up to L = 20. The right panel of fig. [7] shows the behavior of 7r L (0) 
if the fraction of interaction partners K/L is kept fixed. These curves show 
non-monotonic behavior with an increase up to a certain value of L and a 
subsequent decrease. Thus the simulations indicate that for large L, there 
will with high probability (possibly with probability 1) be accessible paths 
in such a FL. 



The results shown so far suggest a relatively simple picture, which matches 
that found previously for the RMF model [TU] . Increasing L at fixed L — K 
one observes HoC-like behavior, with accessibility (measured by 1 — ttl(0)) 
decreasing monotonically with L. On the other hand, when the fraction of 
interacting sites is kept fixed, accessibility increases with L, and decreases 
(for large enough L) with increasing K/L, indicating that the parameter 
K/L plays a role similar to the field c in the RMF model. 

However, the data depicted in fig. [5] for the behavior of 7Tl(0) at fixed K 
reveal a more complex scenario. The results for K > 5, which were already 
reported in [TU], show that, as expected, 7Tl(0) decreases monotonically with 
L, a behavior that is seen to extend also to K — 4 and K = 3. In striking 
contrast, the curve for K = 2 seems to be almost independent of L, and even 
more surprisingly, for K = 1 7Ti,(0) increases with L. This non-monotonic 
-ftT-dependence of the accessibility measure 1 — 7Tl(0) is highly unexpected 
and cannot be inferred, e.g., from the results for the expected number of 
accessible paths. Together with the characteristic differences in the shape 
of the distribution 7r L (n) (see above) and the corresponding distribution of 
basin sizes (see below), the results shown in fig. |S] provide a strong indication 
for a qualitative change in the LK fitness landscapes at K = 2. We will 
return to the possible reasons for this puzzling behavior below in sect. [4] 
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Fig. 8 The probability tvl(O) that a given realization of a FL has no accessible 
paths at all as function of the sequence length. Note that while the curves for 
K > 3 decrease with L, the curve for K — 1 actually increases for the range of L 
considered here. The error bars are due to a lower number of samples for larger 
systems (10 3 or 10 4 ) 

3.2 Basin of attraction of the global maximum 

Under greedy dynamics, the configuration space can be decomposed into dis- 
joint BoA's, each belonging to one specific local optimum. Thus the expected 
size of a typical BoA, {b typ ), can be estimated by dividing the total number 
of states by the expected number of optima (N opt } [55] . For the HoC model, 
the average number of local optima can easily be computed: Since a local 
optimum has to have a greater fitness than each of its L neighbors [which 
happens with probability 1/(L + 1)], the expected number of local optima is 
(N opt ) = 2 L /(L + 1) and henc<fl 

*->-}S5>-i+i- (5) 

Since asymptotic results for the number of local maxima in the LK model 
exist [56,57,58,59 , one can straightforwardly obtain analogous estimates for 
(btyp) in this case. However, this only gives a lower bound to the size of (bcAi) 
of the BoA of the GM which grossly underestimates the actual behavior and 
will not be presented here. 

One reason for the difference in behavior between (bt yp ) and (&ga/) is 
that a typical, i.e. only local maximum has to compete for each of its nearest 
neighbor states with another local optimum, whereas the GM does not have 
to compete for its nearest neighbors: By definition, for each neighboring 
sequence the GM is the state of highest fitness in its neighborhood. Thus the 
BoA of the GM comprises at least L + 1 configurations, whereas the BoA's 
of typical maxima can be (and often are) smaller, see fig. [1] for illustration. 
Numerical simulations of the distribution Pl^gm of the BoA for the HoC 

8 Note that this estimate relies on uniformly sampling all local maxima. The size 
of a typical basin found by a greedy walk starting at a random position will be 
larger, because this introduces a bias towards larger basins. 
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Fig. 9 (Color online) Distribution of the size of the basin of attraction of the 
GM for the HoC model as function of the rescaled basin size b/2 L . As the sequence 
length L increases, the distribution becomes more concentrated near the minimum 
value b m i n = L + 1. Inset shows the same data in double- logarithmic scales 
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Fig. 10 (Color online) Size distribution of the GM's BoA for K — 1 and different 
values of L as function of the fraction of state space b/2 L . Note that the spikes 
dominating the distribution remain visible and occur at the same fraction of b/2 L 
for all values of sequence length considered 



model show that as L grows, this distribution becomes more concentrated 
around the minimal value L + l, see fig. El Since the HoC model is maximally 
rugged, the question remains whether the BoA of the GM comprises a finite 
fraction of the entire state space for the Li^-model with K < L — 1. Before 
addressing this question, the full size distribution of the BoA of the GM will 
be studied. 

For K = 1, the distribution of the GM's BoA shows a striking discrete 
structure with spikes that remain at fixed values of b/2 L as L grows. As can 
be seen in fig. 1 101 the most important contribution comes from realizations 
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Fig. 11 (Color online) Size distribution of the BoA of the GM for K = 2 (left) and 
K — 3 (right) for various sequence lengths as function of the fraction of sequence 
space. Note that unlike the case K = 1, here the discrete structures vanishes as L 
grows 
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Fig. 12 (Color online) The left panel shows expected size of the BoA of the GM as 
a function of L for various values of K on semi- logarithmic scales. The growth with 
L is exponential, as indicated by the straight lines. For comparison the full volume 
of state space 2 L is also shown. The right panel shows the exponential growth rate 
a of the expected BoA size as function of K 



in which the entire state space is in the BoA, followed by contributions from 
realizations with exactly 3/4 of the state space in the BoA. Similar to the 
distribution of the number of accessible paths for K = 1 shown in fig. |H the 
fact that the positions of dominant peaks remain at the same fraction of state 
space indicates that the effect causing these 'spectra' is commensurate (in 
some sense) with the total size of sequence space. Some amount of discrete 
structure remains visible for K = 2 and 3, but as can be seen in fig. [TTJ 
this appears to be a finite-size effect which becomes less prononouced as L 
increases. 

Figure [T2l shows in the left panel the expected size of the GM's BoA (ben) 
as a function of the sequence length L. Clearly (bcM) grows exponentially, 
{boAi) ex exp(ai) with In 2 > a > 0.5 depending on K (right panel of fig. li"2"]) . 
Since a < In 2 the fraction of state space in the BoA of the GM appears 
to approach zero asymptotically. Similar to the results for the number of 
accessible paths shown in fig. [6j we see that (bcM{L)) grows very slowly 
with L when L is increased at fixed L — K. 
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Fig. 13 (Color online) Distribution of the GM's BoA for the RMF and LisT-models 
with L = 9 and different values of c and K as indicated. The value c = 0.5 for the 
RMF model has been chosen to match the distribution of basin sizes as obtained 
fro m t he L K model. Such a match is clearly not possible for K — 1 and 2, see figs. 
1101 and [TT1 Decreasing c to the value 0.2 moves the RMF distribution close to the 
HoC distribution shown in the inset of fig. [9] 



4 Conclusions 



In this paper, we investigated the evolutionary accessibility of fitness land- 
scapes derived from Kauffman's LK model. Accessibility of these landscapes 
was quantified in terms of topographic properties that govern the adapta- 
tion in the strong selection/weak mutation regime (accessible paths) and the 
greedy regime of adaptation (basins of attraction). The LK model allows to 
tune the number of interactions between genetic loci and thus the rugged- 
ness of the landscape. However, it is not a priori clear how this parameter K, 
which governs the interaction strength, should behave as function of sequence 
length L. Therefore three different scalings of K with L were considered. 

In the case of fixed difference L — K = const., the accessible paths were 
found to behave quite similar to the totally rugged case K = L — 1 for large L. 
However if the fraction K/L is kept fixed, the landscapes become smoother 
in the large L limit in the sense that the expected number of accessible paths 
grows (possibly even super-exponentially, see fig. |5|) and that the probability 
of finding a realization without any accessible paths decays, see the right 
panel of fig. [3 

In the case of fixed K, the behavior of the mean number of accessible 
paths (ni) was found to be in accordance with the naive expectation that 
the resulting FL should be very smooth since K/L — > as L — > oo, see fig. 
H3 However the probability of no accessible paths ttl (0) showed a surprising 
behavior. For K > 3, ttl(0) declines monotonically with L and the curves 
are very similar. For K = 1, however, 7Tl(0) increases with L. In the case 
K = 2, 717,(0) seems almost independent of L, thus this appears to be a 
marginal case separating the two regimes. This was not expected because 
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K = 1 is the (non-trivial) case with the least amount of interactions and was 
thus expected to be smoother than those for larger K. 

This difference in behavior of K = 1 and K > 3 is also reflected in the 
full distribution 7rx,(n) shown in figs. [4] and [5] In one case, 77^(71) remains 
dominated by characteristic peaks whereas for K > 3, the curves become 
smoother with increasing sequence length eventually showing the behavior 
found in fig. [3J for the HoC case K = L — 1. 

The full distribution of the BoA of the GM under greedy dynamics also 
showed a characteristic spike structure for K = 1, see fig. [TUJ As for the 
distribution of the number of accessible paths, the structure remained essen- 
tially unchanged for the sequence lengths considered. Under rescaled vari- 
ables b/2 L , the dominant spikes retain their positions, which indicates that 
the reason for these dominant features of the distribution is commensurate 
with the total number of states 2 L . 

At this point we can only speculate about the reasons for the marked 
change in behavior that seems to occur in the LK model around K = 2. We 
have noted above that the LK-mode\ can be seen as a dilute spin glass with 
p-spin couplings up to p = K + 1. In the SG context it is well known that 
p-spin models behave qualitatively different for p — 2 and p > 3, respectively, 
with regard to static [53] as well as dynamic [641165] properties. It remains 
to be seen if the observations reported here can be understood from the SG 
perspective. 

The RMF model, which was briefly alluded to here, is another model 
which yields tunably rugged fitness landscapes. However, unlike the LK 
model, the distributions obtained for the RMF model show no signs of dis- 
crete structures and, as c — > 0, the curves change smoothly into those from 
the HoC model, see for example fig.lT3"land results presented in [TU]. From a 
fitness landscape point of view it is remarkable that the way in which 'inter- 
mediate ruggedness' arises (by an external 'field' or by internal interactions) 
plays such an important role, while from a spin glass point of view, it is 
perhaps surprising that the two ways of generating intermediate ruggedness 
yield similar behavior at all. 
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